International Journal of Epidemiology — Latest Matching Preprints

1

Breast cancer over-diagnosis due to mammography screening - A long-term follow-up population study of BreastScreen Norway

Heggland, T.; Vatten, L. J.; Opdahl, S.; Weedon-Fekjaer, H.

2026-06-03 epidemiology 10.64898/2026.06.02.26354696 medRxiv

Top 0.1%

38.0%

Show abstract

Objectives Estimates of breast cancer over-diagnosis related to mammography screening varies substantially. Over-diagnosis is commonly defined as cases that would not have been detected during the persons remaining lifetime in the absence of screening. We here aim to quantify over-diagnosis in the population-based BreastScreen Norway mammography screening program using long-term follow-up and more detailed modeling than previous studies. Setting We applied data on Norwegian screening patterns and breast carcinoma incidence for the period 1987-2019, covering women aged 49-84 years, leveraging the gradual implementation of the organized biennial BreastScreen Norway screening program for women aged 50-69 during 1995-2005. Methods Using an extended age-period-cohort model, we estimated excess lifetime risk of invasive breast cancer and ductal carcinoma in situ in the presence of program screening, as an indicator of over-diagnosis among screen-detected cases. Results Lifetime risk of breast carcinomas was 6.6% (95% confidence interval 2.5% to 10.7%) higher for invited than for non-invited women. This indicates that 18% (95% confidence interval 7.3% to 28.0%) of screen-detected cases may be over-diagnosed, and that approximately one in 86 (95% confidence interval 54 to 210) screened women were over-diagnosed during their screening period. Using effect estimates from previous studies, we estimated that approximately three women are over-diagnosed for every breast cancer death prevented by screening, and that 87% of over-diagnosed tumors might grow extremely slowly. Conclusions Over-diagnosis related to mammography screening is a considerable problem, but its extent may be smaller than reported in some previous studies. Most over-diagnosed tumors likely grow very slowly.

2

A literature scanning and prioritization framework to guide future systematic reviews for World Cancer Research Fund International's Global Cancer Update Programme

Markozannes, G.; Jayedi, A.; Cividini, S.; Kazmi, S. Z.; Cariolou, M.; Vieira, R.; Pagkalidou, E.; Kiss, S.; Balducci, K.; Aune, D.; Gunter, M. J.; Cross, A. J.; Chan, D. S. M.; Tsilidis, K. K.

2026-05-08 epidemiology 10.64898/2026.05.07.26352530 medRxiv

Top 0.1%

23.7%

Show abstract

BackgroundThe 2018 World Cancer Research Fund (WCRF)/American Institute for Cancer Research Third Expert Report (TER) on diet, adiposity, physical activity and risk of 19 cancers could be enhanced with new data. A framework is needed to prioritize future systematic reviews. MethodsWe searched PubMed (January 2019-February 2024) for meta-analyses, pooled analyses, randomized controlled trials (RCTs), Mendelian randomization (MR) studies, and large (>100,000 participants) cohort studies. We assessed TER findings using conditional power (CP) and fail-safe number (FSN) statistics. We developed an exposure-based prioritization score (PS) by awarding or subtracting points considering the quantity, statistical significance, direction, and novelty of associations. ResultsWe compared 366 meta-analyses, 121 pooled analyses, 19 RCTs, 174 MR studies, and 391 cohort studies covering 151 exposures and 28 cancers with 1,371 TER meta-analyses. Based on CP, non-significant TER associations likely to become significant with additional evidence included folate and colorectal, waist circumference and lung, total fat and ovarian, tea and ovarian, and red meat and kidney cancers. The FSN indicated that most significant TER associations are unlikely to change with additional evidence. The median PS was 6 (range: -15 to 163), with top scores observed for anthropometric measurements (PSheight=40 to PSBMI=163), physical activity (PS=100), sedentary behavior (PS=64), alcohol (PS=52), tea (PS=36), dietary fiber (PS=31), milk/dairy (PS=29), micronutrients (PSretinol=27 to PSiron=38), vitamins (PSB12=22 to PSvitD=91), soy (PS=24), isoflavones (PS=23), and sugar-sweetened beverages (PS=22). Conclusions and ImpactThe prioritization framework can help identify impactful systematic reviews to complement TER conclusions and enhance our understanding of emerging research.

3

Ethnic Differences in the Timing and Incidence of Childhood Health Conditions: Evidence from the Born in Bradford Cohort

Santorelli, G.; Cheung, R. W.; Bhopal, S.; Wright, J.

2026-04-01 epidemiology 10.64898/2026.03.31.26349839 medRxiv

Top 0.1%

22.9%

Show abstract

Objective To examine ethnic differences in the incidence and age-related trajectories of childhood health conditions from birth to adolescence within a UK birth cohort. Design Longitudinal population-based birth cohort with linkage to primary care electronic health records. Setting Born in Bradford (BiB), a multi-ethnic birth cohort in Bradford, UK. Participants 13,282 children (36% White British, 44% Pakistani British, 20% other ethnicity) born 2007 to 2011 with linked primary care records and over 1 year follow-up. Main outcome measures Incident diagnoses of atopic conditions (asthma, eczema, allergic rhinoconjunctivitis), overweight/obesity, common mental health disorders (anxiety, depression), and neurodevelopmental disorders (including ADHD and autism). Incidence rates, Kaplan-Meier cumulative incidence, and Cox regression hazards ratios (HRs) were estimated. Results Atopic conditions emerged early (median onset 5 to 6 years) and were more common among Pakistani British children, with higher hazards of eczema (HR 2.29, 95% CI 2.01 to 2.61), allergic rhinoconjunctivitis (HR 2.27, 2.00 to 2.58), and asthma (HR 1.35, 1.22 to 1.50). Overweight/ obesity developed later (median 9 to 10 years) and were also more frequent in Pakistani British children (HR 1.25, 1.16 to 1.35). In contrast, common mental health disorders emerged predominantly in early adolescence (median around 13 years), and both mental health and neurodevelopmental diagnoses were more frequently recorded among White British children; Pakistani British children had lower hazards of neurodevelopmental diagnoses (HR 0.28, 0.23 to 0.35) and mental health disorders (HR 0.53, 0.41 to 0.70). Conclusions Ethnic differences in childhood health are condition-specific and vary by age of onset, emerging at distinct stages. These findings inform the timing of prevention, service planning, and research into underlying mechanism.

4

The Robust Bidirectional Association Between Chronic Lung Disease and Incident Osteoporosis: A Two-Stage Individual Participant Data Meta-Analysis of Three International Longitudinal Cohorts (HRS, SHARE, and ELSA)

Jiang, D.; Bao, J.

2026-03-19 respiratory medicine 10.64898/2026.03.18.26348689 medRxiv

Top 0.1%

22.5%

Show abstract

Abstract Background: The association between chronic lung disease (CLD) and osteoporosis (OP) is well-recognized, but the direction and magnitude of this relationship remain debated, particularly in aging populations. We aimed to quantify the bidirectional association between CLD (including COPD and asthma) and incident OP using a two-stage individual participant data (IPD) meta-analysis of three large longitudinal cohorts. Methods: We harmonized and analyzed individual-level data from the Health and Retirement Study (HRS, USA), the Survey of Health, Ageing and Retirement in Europe (SHARE, Europe), and the English Longitudinal Study of Ageing (ELSA, UK), all comprising adults aged greater than or equal to[≥]50 years. In the first stage, Cox proportional hazards models were fitted separately in each cohort to estimate hazard ratios (HRs) for the forward (CLD[->]OP) and reverse (OP[->]CLD) associations, adjusting for a comprehensive set of confounders (demographics, lifestyle, comorbidities, functional status). In the second stage, cohort-specific log HRs were pooled using fixed-effect meta-analysis. Heterogeneity was assessed with the I-squared statistic. Results: A total of 40,050 participants were included across the three cohorts. The pooled HR for incident OP among individuals with baseline CLD was 1.37 (95% confidence interval [CI] 1.24-1.51), with similar estimates for COPD (HR 1.47, 95% CI 1.27-1.69) and asthma (HR 1.35, 95% CI 1.22-1.50). For the reverse association, baseline OP was associated with increased risk of incident CLD (pooled HR 1.16, 95% CI 1.05-1.29), COPD (HR 1.28, 95% CI 1.11-1.47), and asthma (HR 1.17, 95% CI 1.05-1.30). Heterogeneity was low across all analyses (I2[≤]7.5%). Conclusion: This two-stage IPD meta-analysis provides robust evidence of a bidirectional relationship between CLD and OP in older adults. These findings underscore the need for integrated screening and management of both conditions in aging populations.

5

Whom Does Algorithmic Risk Stratification Miss? A Fairness Audit of Machine Learning Targeting for Concurrent Maternal-Child Double Burden of Malnutrition Across 30 Low- and Middle-Income Countries

WU, X.; Zheng, B.

2026-04-30 epidemiology 10.64898/2026.04.28.26352000 medRxiv

Top 0.1%

22.1%

Show abstract

BackgroundConcurrent maternal-child double burden of malnutrition (DBM) affects a growing share of mother-child dyads in low- and middle-income countries (LMICs). Nutrition programmes often use maternal education as an eligibility proxy, but whether algorithmic alternatives would do better--and at what equity cost--has not been directly tested. We evaluated whether machine learning (ML)-based targeting for two concurrent DBM subtypes--overweight mother with stunted or wasted child (Subtype A) and underweight mother with stunted or wasted child (Subtype B)--improves recall over a proxy-based rule while preserving fairness across social strata. MethodsWe pooled Phase 7-8 Demographic and Health Surveys from 30 LMICs (181,636 mother-child dyads). We first estimated subtype-specific social gradients with multilevel logistic regression. We then trained xgboost prediction models with strict label-leakage safeguards and leave-one-country-out cross-validation, and compared ML-based targeting against random and education-based rules at 10%, 20%, and 30% budget constraints. Fairness was audited along six social strata using equalized-odds, demographic-parity, calibration, and predictive-value gaps. A full-India sensitivity analysis (354,691 dyads) assessed robustness to down-sampling. FindingsOverall weighted any-DBM prevalence was 12.52% (Subtype A: 8.21%; Subtype B: 4.31%). Subtype A showed an inverted-U gradient on wealth (adjusted odds ratio peak 1.22 at Richer versus Poorest) and maternal education (peak 1.23 at Primary versus None); Subtype B declined monotonically (wealth: 0.25 at Richest; higher education: 0.49). Mean leave-one-country-out area under the curve was 0.615 for Subtype A and 0.652 for Subtype B. At a 20% budget, ML captured 35.3% of Subtype A cases versus 18.4% for education-based targeting (+92%); for Subtype B the corresponding values were 37.5% and 32.1% (+17%). Equalized-odds gaps reached 0.57 on country income and 0.59 on maternal education; true-positive rates were lowest in the highest-wealth and highest-education strata. Results were stable under the full-India sensitivity analysis. ConclusionsML is useful principally for Subtype A, where the education proxy is no better than random. For Subtype B it mostly changes who gets reached rather than how many, which is a policy choice rather than an accuracy upgrade. The households the algorithm most often misses are not the poor but the rare positives in high-resource strata, which is what a fixed-budget rule ranking on heterogeneous base rates will do. Programmes should decide whether their priority is total capture or the distribution of capture before adopting such a rule. Author SummaryIn many low- and middle-income countries, mothers who are overweight often live in the same household as children who are too short or too thin for their age. Nutrition programmes that try to reach such families have limited resources, so they must choose which households to prioritise. Most programmes use maternal education level as a rough filter, but whether this is actually a good way to find affected families has rarely been tested. We used surveys of 181,636 mother-child pairs from 30 low- and middle-income countries to compare three ways of identifying at-risk households: random selection, selection by low maternal education, and selection by a machine-learning model. Machine learning was much better at finding families where an overweight mother lives with an undernourished child--nearly doubling the capture rate compared with the education rule. For a different combination (underweight mother with an undernourished child), machine learning did not clearly outperform education on total recall; instead it reached different households, mostly shifting attention toward the rural poor. An unexpected finding was that the households the algorithm was most likely to miss were not the poor ones, but the wealthier and better-educated ones, where this type of malnutrition is rarer. This is not bias against the poor--it is what happens when any ranking rule operates under a fixed budget. Programmes that want to reach everyone at risk, regardless of how rare risk is in a given group, may need more than one rule.

6

Analytic Choices Shape Genomic Risk Estimates from Electronic Health Records: Coronary Heart Disease in eMERGE IV

Chen, J. H.; Knerr, S. A.; Veenstra, D. L.; Abul-Husn, N. S.; Hanks, S. C.; Kenny, E. E.; Limdi, N. A.; Cortopassi, J. B.; Crosslin, D.; Jarvik, G. P.; Kullo, I. J.

2026-04-30 epidemiology 10.64898/2026.04.28.26352002 medRxiv

Top 0.1%

18.5%

Show abstract

BackgroundElectronic health records (EHR) are an important data source for genomic studies, but challenges exist in ascertaining cases and observation start time. We used data derived from the Electronic Medical Records and Genomics (eMERGE) IV study to examine how analytic assumptions regarding case ascertainment and EHR entry time influence estimation of monogenic and polygenic risks for coronary heart disease (CHD). MethodsWe assessed agreement between CHD cases ascertained from EHR phenotyping and survey. Associations of monogenic variants and high (top 5%) PRS of CHD were evaluated using multivariate relative risk (RR) regression under three alternative case definitions: EHR-algorithm-defined, self-reported, and combined. Time-to-event analyses (Kaplan-Meier method and Cox proportional hazards models) were conducted under three entry time specifications: (1) entry at the first EHR record, (2) entry at the start of the latest consecutive observation period, and (3) no left truncation. ResultsThe agreement between CHD cases ascertained by the EHR-based algorithm versus self-report was 37.2% among individuals identified as cases by at least one source, with the EHR algorithm demonstrating higher accuracy. The adjusted RR [95% confidence interval (CI)] associated with high PRS was 2.05 [1.50, 2.81] for EHR-defined, 1.49 [1.04, 2.13] for self-reported, and 1.66 [1.27, 2.18] for combined CHD. Estimated cumulative incidence by age 75 was 0.188 using the first EHR code as left truncation and 0.225 using the most recent observation period. Hazard ratio (HR) estimates were similar across the three left truncation scenarios. ConclusionThe choice of case definition meaningfully influenced RR estimates, whereas alternative specifications of EHR entry time affected absolute cumulative incidence estimates but has minimal impact on HR. These findings highlight the impact of analytical choices in EHR and survey-data-based studies that are applicable beyond the context of CHD.

7

Data Resource Profile: EST-Health-30

Reisberg, S.; Oja, M.; Mooses, K.; Tamm, S.; Sild, A.; Talvik, H.-A.; Laur, S.; Kolde, R.; Vilo, J.

2026-04-24 epidemiology 10.64898/2026.04.21.26351087 medRxiv

Top 0.1%

17.3%

Show abstract

BackgroundThe increasing availability of routinely collected health data offers new opportunities for population-level research, yet access to comprehensive, linked, and standardised datasets remains limited. We describe EST-Health-30, a large-scale, population-representative health data resource from Estonia. MethodsEST-Health-30 comprises a random 30% sample of the Estonian population (~500,000 individuals), with longitudinal data from 2012 to 2024 and annual updates planned through 2026.Individual-level records are linked across five nationwide databases, including electronic health records, health insurance claims, prescription data, cancer registry, and cause of death records. A privacy-preserving hashing approach ensures consistent cohort inclusion over time while maintaining pseudonymisation. All data are harmonised to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (version 5.4) using international standard vocabularies. Data quality was assessed using established OMOP-based validation frameworks. ResultsThe dataset contains rich multimodal information on diagnoses, procedures, laboratory measurements, prescriptions, free-text clinical notes, healthcare utilisation, and costs, with high population coverage and longitudinal depth. Data quality assessment showed high completeness and consistency, with 99.2% of applicable checks passing. The age-sex distribution closely reflects the national population, supporting representativeness, though coverage is marginally below the target 30% (29.2%), primarily attributable to recent immigrants without health system contact. The dataset enables construction of detailed clinical cohorts, analysis of disease trajectories, and evaluation of healthcare utilisation and outcomes across the life course. ConclusionsEST-Health-30 is a comprehensive, standardised, and population-representative real-world data resource that supports epidemiological, clinical, and methodological research. Its alignment with the OMOP CDM facilitates reproducible analytics and participation in international federated research networks, while secure access infrastructure ensures compliance with data protection regulations. Key featuresO_LIEST-Health-30 is a population-representative dataset of complete health records for a random 30% sample of the Estonian population (~500,000 individuals) spanning 2012-present, enabling population-level epidemiological analyses with annual updates. C_LIO_LIThe dataset is constructed using a random sampling approach based on hashed password-protected personal identifiers, ensuring consistent inclusion over time with unbiased population coverage. C_LIO_LIIndividual-level data are linked across multiple nationwide databases, including electronic health records, claims, prescriptions, cancer and cause of death registry data, enabling multimodal analyses of health trajectories. C_LIO_LIAll data are standardised to the OMOP Common Data Model (CDM) version 5.4 using international vocabularies (e.g., SNOMED CT, RxNorm, LOINC), supporting reproducibility and participation in federated research networks. C_LIO_LIThe dataset is accessible through a secure processing environment compliant with the European Health Data Space (EHDS) framework. C_LI

8

Early life multidimensional disadvantage of South Australian children: a whole-population linked data study

Kalamkarian, A.; Pilkington, R. M.; Lynch, J.; Mittinty, M. N.; Malvaso, C.; Hawkins, K.; Pharo, H.; Beck, K.; Chittleborough, C. R.

2026-06-05 epidemiology 10.64898/2026.06.03.26354860 medRxiv

Top 0.1%

17.1%

Show abstract

Background: Whole-population linked administrative data platforms provide an opportunity to generate evidence on early life multidimensional disadvantage to inform resourcing and service provision to families with complex needs. Methods: We used individual-level de-identified data from nine administrative data sources included in the Better Evidence Better Outcomes Linked Data (BEBOLD) platform. The population included all children born in South Australia between 2004-2011 (n=143,083), and their parents. We described the prevalence and distribution of multiple disadvantages affecting children from the 12 months before birth to age 5. Eleven domains of parental disadvantage were created: economic, education, access to services, mental health, substance misuse, smoking during pregnancy, domestic and family violence, health, child protection contact, justice system contact, and death. We investigated the concordance of our measure with an area-level socioeconomic measure used in government reporting. Results: One in two children (48%) were exposed to at least one disadvantage domain, and one in seven (14%) were exposed to three or more domains before age five. Economic disadvantage was most prevalent, affecting one in four (27%) children, of which 75% were exposed to additional forms of disadvantage. Substance misuse, domestic and family violence, and justice system contact were the least likely domains to occur in isolation. Only 54.4% who experienced five or more disadvantage domains were classified in the area-level socioeconomic measure's 'most disadvantaged' quintile. Conclusion: Early life exposure to parental disadvantage can be highly multidimensional. Measurement across different systems is important for informing coordinated service provision for families with complex needs.

9

Methodological Considerations in Sibling Analyses of Prenatal Acetaminophen

Ahlqvist, V. H.; Sjoqvist, H.; Gardner, R. M.; Lee, B. K.

2026-03-30 epidemiology 10.64898/2026.03.27.26349515 medRxiv

Top 0.1%

15.0%

Show abstract

Background: Sibling-matched designs control for shared familial confounding but remain vulnerable to non-shared confounders. Bi-directional sensitivity analyses, which stratify families by whether the older or younger sibling was exposed, are commonly used to assess carryover effects. We aimed to demonstrate how this methodological approach can introduce severe confounding by parity. Methods: We conducted simulations motivated by a recent epidemiological study. The true causal effect of a hypothetical exposure (prenatal acetaminophen) on neurodevelopmental outcomes was set to strictly null. To introduce parity-related confounding, baseline exposure and outcome probabilities were varied slightly by birth order. We compared conditional logistic regression effect estimates from total sibling models against bi-directional stratified models. Results: In the total simulated sibling cohort, models yielded the true null effect (odds ratio = 1.00) when adjusting for parity. However, the bi-directional analyses exhibited divergent artifactual signals. Because parity is perfectly collinear with exposure in these stratified subsets, it cannot be adjusted for. For example, when the older sibling was exposed, the odds ratio for autism spectrum disorder was 1.68; when the younger was exposed, the odds ratio was 0.60. Conclusions: Divergent estimates in bi-directional sibling analyses can be a predictable artifact of parity confounding rather than evidence of carryover effects or invalidating unmeasured bias. Overall sibling models adjusting for parity may remain robust despite divergent stratified sensitivity results.

10

First-time child protection contacts from 0 to 15 years in a whole-population cohort of Australian Aboriginal children born 2006-2020: a data linkage study

Hanly, M. J.; Newton, B.; Ahmed, T.; Payne, T.; Powell, M.; Cripps, K.; Katz, I.; Pilkington, R.; Lynch, J.; Gray, P.; Falster, K.

2026-03-26 epidemiology 10.64898/2026.03.24.26349231 medRxiv

Top 0.1%

14.9%

Show abstract

BackgroundFirst Nations children are over-represented in child protection systems in Australia and other colonised countries. Here, we apply a prevention and equity lens to the use of child protection data, to inform early opportunities to support Aboriginal children and families at risk of escalating child protection contact, from pregnancy to adolescence. MethodsWe followed 15 whole-population cohorts (born 2006-2020) of Aboriginal (n=119,716) and non-Aboriginal (n=1,456,698) children in New South Wales (NSW), Australia, to December 2021, using birth and child protection datasets linked for the NSW Child E-Cohort. In each Aboriginal and non-Aboriginal cohort (2006-2020), we calculated the cumulative incidence (risk) of first-time child protection contacts from the prenatal period up to age 15 years: child concern reports, screened in reports, investigations, child protection-defined substantiations, and OOHC placements. Risk differences and relative risks were also calculated. FindingsBy birth, 10-15% of Aboriginal children born 2006-2020 had a first report to child protection, with 48-54% by age 5y (2006-2016 births), and 74% by age 15y (2006 births), with similar risks of screened-in reports (e.g. 68% by age 15y). The risk of first-time substantiation was 1-5% of Aboriginal children by birth, 17-20% by 5y, and 32% by 15y, with higher risks in more contemporary cohorts. By age 1y, 3-4% of Aboriginal children born 2006-2020 had a first OOHC placement, with 7-9% by 5y, and 14% by 15y. The risk differences between Aboriginal and non-Aboriginal children were 23 and 3 percentage points for reports and OOHC by age 1y (2020 births), respectively, increasing as children age. InterpretationDespite extensive inquiries, calls for prevention and Closing the Gap targets, our study shows the lifetime risk of child protection involvement for Aboriginal families has not improved and inequities persist. These findings support the call for Aboriginal-led approaches and greater investment in early supports for First Nations children and families. Research in ContextEvidence before this study We searched PubMed and Medline for studies on the lifetime risk of child protection contacts among First Nations child populations, published January 2005 to May 2025. Thirteen studies reported various child protection contacts, from the perinatal period through childhood, among birth or synthetic cohorts of First Nations children, born between 1990 and 2018, created from population data sources in jurisdictions in Australia (n=5), the United States(US) (n=6), and Aotearoa/New Zealand (NZ) (n=2) (Table E1). O_TBL View this table: org.highwire.dtl.DTLVardef@1a0d510org.highwire.dtl.DTLVardef@4198eorg.highwire.dtl.DTLVardef@129da77org.highwire.dtl.DTLVardef@c5e234org.highwire.dtl.DTLVardef@18600d7_HPS_FORMAT_FIGEXP M_TBL O_FLOATNOTable E1.C_FLOATNO O_TABLECAPTIONSystematic Review Results: Details of 13 studies on the lifetime risk of child protection contacts among First Nations child populations, published January 2005 to May 2025. C_TABLECAPTION C_TBL The most recently published study included First Nations children born 2000 to 2013 in Western Australia, which quantified the risk of reports, investigations, substantiations and removals into OOHC, from age 1 to 16 years. By age 1, 12% were reported and 3% were removed into OOHC. By age 16, 52% were reported, and 14% were removed into OOHC. Prior studies of birth or synthetic cohorts of First Nations children born 1990-2018, in the USA, NZ, and South Australia showed similar results. By age 5 years, 16% to 54% for reports, 20% for investigations, 7% to 11% for substantiations and 8% for removals into OOHC. Among the five studies with cohorts followed to 18 years, 42% were reported, 28% to 50% were investigated, 9% to 27% were substantiated, 7% to 16% were removed into OOHC and 0.8% to 3.8% had termination of parental rights. Added value of this study This is the largest and most contemporary study to quantify the lifetime risk of child protection contact among whole-populations of First Nations children internationally. Among 15 consecutive whole-population cohorts of First Nations children in New South Wales (NSW), Australia, born 2006 to 2020, we reported--for the first time--the full spectrum of child protection contacts, from the prenatal period. By birth, 16% were reported to child protection, 14% were investigated and 5% were substantiated in the most contemporary cohort born 2020. By age 1 year, 2.8% were removed into OOHC. In the oldest cohort born 2006, 74% were reported and 14.4% removed into OOHC by age 15 years. We also reveal the magnitude of the inequity in child protection contacts between First Nations and non-Indigenous children across the lifecourse. For example, among 2006 births, the risk of first-time reports to child protection for Aboriginal and non-Aboriginal children, respectively, was 10.5% versus 1.5% by birth (risk difference (RD), 9 percentage points; risk ratio (RR), 7.0), 53% vs 16% by age five (RD, 38pp; RR, 3.4) and 74% vs 33% by age 15 (RD, 41pp; RR 2.2). Implications of all the available evidence This study unequivocally shows that the lifetime risk of child protection involvement in the lives of First Nations families has not reduced in more contemporary whole-population cohorts and that inequities persist. This is consistent with evidence from prior studies internationally. It is critical that First Nations-led responses and investment in early family supports must be at the centre of system reform to realise the long-called-for shift toward prevention and to re-dress the pervasive inequities experienced by First Nations children and families in colonised countries such as Australia.

11

A New Mixed Frequency Regression Model For Environmental Epidemiology

Shukla, N.; Bartington, S. E.; Hansell, A. L.; Lucas, T. C.

2026-06-04 epidemiology 10.64898/2026.06.03.26354801 medRxiv

Top 0.1%

14.8%

Show abstract

Background: In the absence of high-resolution response data, exposure-response modelling often relies on aggregated low-frequency exposure data, leading to loss of high-resolution information. Mixed Data Sampling (MIDAS) from econometrics offers an alternative but is limited due to its inability to make high-resolution predictions, inflexible likelihoods and penalised nonlinear functions, and limited visualization options. We propose a mixed-frequency Distributed Lag Non-linear Model (mf-DLNM) which can eliminate the need to aggregate exposure data in environmental epidemiology and provide high resolution predictions for time series studies. Methods: We evaluated the inference and predictive performance of the mf-DLNM. To evaluate its ability to estimate exposure-response relationships, we applied mf-DLNM and same-frequency (sf)-DLNM using data from the West Midlands, UK. Additionally, we compared the predictive performance of mf-DLNM with sf-DLNM and MIDAS across nine regions of England. As MIDAS cannot predict at the resolution of the predictor (daily), we compared the predictive performance of mf-DLNM and MIDAS at weekly resolution. To test the model's ability to predict high temporal resolution risk (daily), we compared sf-DLNM (with access to daily mortality counts) with mf-DLNM (with access only to weekly mortality counts). Results: In the West Midlands example, mf-DLNM performed comparably to sf-DLNM in estimating daily risk of temperature on respiratory mortality. Furthermore, mf-DLNM and MIDAS exhibited similar performance for weekly predictions. For high-resolution predictions, mf-DLNM and sf-DLNM showed nearly similar performance, despite mf-DLNM having access only to low-resolution response data. Conclusion: This mixed-frequency approach in environmental epidemiology overcomes the limitations of predicting health risks using aggregated exposure data and provides estimates of high-resolution outcomes in the absence of high-frequency health outcome datasets.

12

Mapping the Dynamic Interplay of Mental Health and Weight Across Childhood: Data-Driven Explorations Using Causal Discovery

Larsen, T. E.; Lorca, M. H.; Ekstrom, C. T.; Vinding, R.; Bonnelykke, K.; Strandberg-Larsen, K.; Petersen, A. H.

2026-04-17 epidemiology 10.64898/2026.04.16.26350943 medRxiv

Top 0.1%

14.7%

Show abstract

Childhood weight development, especially overweight and obesity, has been associated with mental health, but their dynamic, causal relationships, and whether these differ by sex, remain unclear. We applied causal discovery to data from the Danish National Birth Cohort (n=67,593) spanning six periods from pregnancy to late adolescence and considering 67 variables related to child and parental weight, mental health, lifestyle, and socio-economic factors. We found no statistically significant difference between the causal graphs for boys and girls (P=0.079). The data-driven models found causal influence of childhood weight on subsequent weight status. Mental health pathways were exclusively within or across adjacent periods and centered on early adolescent stress. We examined the interplay between a subset of mental health variables, containing information on externalizing and internalizing problems, and weight, and found no direct causal pathway between the two processes. These findings suggest that observed links between weight and these mental health measures may be attributable to confounding. Our findings demonstrate the value of data-driven causal discovery in large cohort studies and how to test for differences in causal mechanisms across subgroups. Results are available in an interactive application, enabling future research to further explore the interplay between weight and mental health.

13

Using human genetic variation to estimate the effect of lipoprotein(a) lowering on pregnancy outcomes

Urquijo, H.; Goldfine, A. B.; Casas, J. P.; Xu, H.; Timsit, Y. E.; Mendelson, M. M.; Hache, C.; Jones, I.; Arustamian, D.; Magnus, M. C.; Gaunt, T. R.; Lawlor, D. A.; Borges, M. C.

2026-05-20 epidemiology 10.64898/2026.05.18.26351595 medRxiv

Top 0.1%

14.5%

Show abstract

Background: Lipoprotein(a) (Lp[a]) is a genetically determined causal and independent cardiovascular risk factor and Lp(a) targeted therapies are being developed. However, evidence on the safety of substantial Lp(a) lowering during pregnancy is limited. We evaluated the impact of Lp(a) lowering on adverse pregnancy and perinatal outcomes (APPOs) using human genetic evidence. Material and Methods: We applied a drug-target Mendelian randomization (MR) approach using genetic variants associated with Lp(a) in the UK Biobank at the LPA locus to proxy pharmacological Lp(a) lowering. Summary-level APPO data were obtained from the MR-PREG collaboration, comprising up to 714,899 women across multiple studies. Twenty APPOs were included. Sensitivity analyses included adjustment for fetal genotype, alternative Lp(a) datasets, leave-one-study-out analyses, and exploration of Lp(a) genetic scores and individuals homozygous for LPA loss-of-function variants in the UK Biobank. Results: Across 20 APPOs, MR estimates showed no strong evidence of causal effects, with no associations surviving false discovery rate P-value correction. Most estimates were close to null, including gestational hypertension, gestational diabetes, preeclampsia, miscarriage and neonatal intensive care unit admission. Some associations were slightly larger in magnitude but with wide confidence intervals: gestational age (mean difference 0.04 weeks, 95% CI 0.02-0.06 per 210nmol/L reduction in Lp[a]) and congenital malformation (OR 0.82, 95% CI: 0.72-0.94) in the protective direction of effect, and higher odds of stillbirth (OR 1.09, 95% CI: 1.00-1.19) and low Apgar at 1 minute (OR 1.11, 95% CI: 0.99-1.24). Sensitivity analyses consistently supported the primary findings, with no evidence of increased maternal nor offspring risk in analyses adjusting for maternal-fetal genotype, across alternative exposure datasets, or in leave-one-study-out tests. Individual-level analyses of Lp(a) genetic score and LPA loss-of-function variants showed no associations, although power was limited. Conclusion: These findings suggest that substantial lowering of Lp(a) is unlikely to increase APPO risk, although modest effects, particularly for rare outcomes, cannot be excluded.

14

Mother-infant linked UK electronic birth cohorts representing 17.5 million births harmonised to the OMOP common data model

Seaborne, M.; Durbaba, S.; Mendez-Villalon, A.; Giles, T.; Gonzalez-Izquierdo, A.; Hough, A.; Sanchez-Soriano, C.; Snell, H.; Cockburn, N.; Nirantharakumar, K.; Poston, L.; Reynolds, R.; Santorelli, G.; Brophy, S.

2026-03-25 public and global health 10.64898/2026.03.23.26349078 medRxiv

Top 0.1%

14.4%

Show abstract

We describe the harmonisation of five UK electronic birth cohorts to the Observational Medical Outcomes Partnership (OMOP) Common Data Model, creating a large scale, standardised resource for maternal and child health research. The Mother and Infant Research Data Analysis (MIREDA) partnership developed and implemented reproducible guidelines for mapping maternal infant relationships and identifying pregnancy episodes within routinely collected healthcare data. Cohorts from England, Scotland, and Wales were transformed despite substantial heterogeneity in data structure, coding systems, and variable definitions. The resulting harmonised resource preserves each cohort as an independent dataset while enabling federated analyses to be conducted across sites without the need to share individual level data. Collectively, the cohorts capture over 17.5 million live births, providing sufficient scale to investigate rare exposures and outcomes, support trial emulation, and evaluate population level policy impacts across the UK. This article details the transformation pipeline and provides reusable methods to support extension to additional cohorts and networks. The harmonised datasets enable interoperable, reproducible research and facilitate cross national comparative studies in maternal and child health.

15

Global variation in cardiometabolic risk structures: A 48-country comparative Bayesian network analysis in 146,000 participants using WHO STEPS data

Babagoli, M. A.; Beller, M. J.; Scutari, M.; Gonzalez-Rivas, J. P.; Noronha, J. C.; Medicine, A.; Sulbaran, N.; Cabrera, S. S.; Fallahzadeh, A.; Iruvanti, S.; Nieto-Martinez, R.; Mechanick, J. I.

2026-05-20 public and global health 10.64898/2026.05.15.26353288 medRxiv

Top 0.1%

14.2%

Show abstract

Background Cardiometabolic-based chronic disease (CMBCD) at an individual level results from complex interactions among a multi-tiered network of sociodemographic, behavioral, and metabolic factors. Though a consensus set of risk factors drives CMBCD, population context influences risk factor effects and interactions. To better understand this phenomenon, we investigated the multi-tiered networking of cardiometabolic variables across diverse populations using a comparative modelling approach. Methods and Findings Utilizing nationally representative cross-sectional data from 48 countries participating in the World Health Organization "STEPwise approach to noncommunicable disease risk factor surveillance" survey, we learned country-specific Bayesian networks including sociodemographic, behavioral, and cardiometabolic variables (adiposity, diabetes, hypertension, hyperlipidemia, and cardiovascular disease). By computing the structural Hamming distance between pairs of networks, we compared differences in network structures across regions and country income levels. We then used the learned networks to assess individual risk factor influences and interactions on cardiometabolic outcomes. Country-specific Bayesian networks varied in terms of the risk factors directly and indirectly associated with the cardiometabolic outcomes. Network structures differed significantly across regions (p = 0.023) but not across income levels (p = 0.91). These results were robust to an alternative learning algorithm, network comparison metric, and data imputation approach. Older age (60+ vs. 30-44 years old) was associated with a greater increase in probability of obesity in Europe and Central Asia (+80%) compared to other regions. Higher education was associated with increased probability of obesity (+53%), diabetes (+18%), and hypertension (+2%) in South Asia but decreased probability of obesity (-10%), diabetes (-32%), hypertension (-16%), and hyperlipidemia (-25%) in Middle East and North Africa. The interaction between age and sex in predicting obesity was significant in the highest proportion of countries in Europe and Central Asia compared to other regions. While this dataset provided standardized data across multiple countries to define cardiometabolic risk factors and drivers, there was limited data on certain health outcomes and uneven availability of data across regions. Conclusions These results revealed specific regional patterns of multi-tiered cardiometabolic risk structures, emphasizing the need for regionally tailored public health strategies rather than applying generalized consensus evidence-based models. Future research should explore the structural drivers of regional differences in inter-relationships of cardiometabolic risk factors, drivers, and disease.

16

Neonatal mortality risk of large-for-gestational age and macrosomic live births in low- and middle-income subnational birth cohorts: An individual participant meta-analysis (2000-2017)

Kirakoya Samadoulougou, F.; Barche, B.; Ukwishaka, J.; Subedi, S.; Erchick, D. J.; Suarez Idueta, L.; Hamer, D. H.; Semrau, K. E. A.; Hamomba, F. M.; Banda, B.; Manasyan, A.; Pry, J. M.; Maleta, K.; Ashorn, U.; Schmiegelow, C.; Hjort, L.; Minja, D. T. R.; Lusingu, J. P. A.; Freitas da Silveira, M.; Buffarini, R.; Baqui, A. H.; Khanam, R.; Ahmed, S.; Zhu, Z.; Zeng, L.; Cheng, Y.; Lachat, C.; Roberfroid, D.; Huybregts, L.; Toe, L. C.; Tielsch, J. M.; Khatry, S. K.; Mullany, L. C.; Ohuma, E. O.; Blencowe, H.; Katz, J.; Lee, A. C. C.; Black, R. E.; Hazel, E. A.

2026-06-06 public and global health 10.64898/2026.06.03.26354851 medRxiv

Top 0.1%

13.9%

Show abstract

Background Large-for-gestational-age (LGA) and macrosomic newborns are at increased risk of adverse perinatal outcomes, including death, yet the burden of neonatal mortality associated with these conditions in low- and middle-income countries (LMICs), where ongoing nutritional and epidemiological transitions suggest their prevalence will rise, remains poorly quantified. In this study, we quantify the neonatal mortality risk associated with LGA and macrosomia from 16 subnational birth cohorts in low- and middle-income countries between 2000 and 2017. Methods and findings This is an individual-participant meta-analysis to estimate neonatal mortality rates (NMRs) and relative risks among LGA infants (>90th and >97th percentile birth weight-for-gestational-age using INTERGROWTH-21st) versus appropriate-for-gestational-age (AGA, 10th-90th percentile) infants. Macrosomic ([≥]4000 g and [≥]4500 g) neonates were compared with those weighing 2500 g-3999g. Missing birth weights were imputed using recalibration and multiple imputation methods. We used random effects meta-analysis to pool relative risks. Median prevalences of LGA >90th and >97th percentile were 5.3% (interquartile range 3.6-8.2) and 2.6% (IQR 1.3-4.5), respectively; macrosomia ([≥]4000 g and [≥]4500 g) prevalences were 1.0% (IQR 0.3-3.1) and 0.06% (IQR 0.0, 0.30), respectively. Mortality was highest among preterm plus LGA infants (61.3 per 1000). LGA infants in the >90th percentile had over twofold increased mortality compared with appropriate-for-gestational-age infants (RR: 2.46; 95% CI: 1.86-3.25), while >97th percentile infants had a higher risk (RR: 3.77; 95% CI: 2.50-5.69). Term LGA >97th percentile infants also showed elevated mortality (RR: 3.14; 95% CI: 1.58-6.22). For LGA >97th percentile, the risk was higher in the early neonatal period (RR: 2.71; 95% CI: 1.92-3.82) than late (RR: 1.69; 95% CI: 1.22-2.34). There was no overall association between macrosomia ([≥]4000 g) and neonatal mortality. Population attributable fractions were 7.2% for LGA >90th percentile and 0.4% for macrosomia ([≥]4000 g). Conclusions Neonatal mortality risks were elevated among LGA infants in low- and middle-income countries, particularly at extreme values (>97th percentile) and during the early neonatal period. Macrosomia showed weaker, less robust associations. Although LGA prevalence is currently low ([~]5%) and contributes less to neonatal mortality than small newborns, ongoing nutritional and epidemiological transitions suggest increasing prevalence. This highlights the need for strengthened surveillance, monitoring, and improved delivery planning to ensure that no population is left behind.

17

Incremental Clinical Value of Single-Molecule Nanopore Sequencing in Thalassemia Testing: A Prospective Double-blind, Multicenter Study

Xiang, J.; Zhu, B.; Xu, H.; Chen, Y.; Sun, X.; xiang, r.; Zhao, Y.; Liu, W.; Zhang, L.; He, J.; liu, j.; Chen, Y.; Fan, Z.; Zhang, H.; Tan, J.; Pang, L.; Shi, L.; Kong, Y.; Cai, A.

2026-06-09 hematology 10.64898/2026.06.09.26354559 medRxiv

Top 0.1%

12.7%

Show abstract

Background Thalassemia is one of the most common monogenic disorders worldwide, current screening strategies combining hematological testing with molecular assays still carry a risk of missed diagnoses and undesirable efficiency, particularly for complex structural variants and rare mutations. Methods In this prospective double-blind, multicenter cohort study of 3,842 participants (3,362 pregnant women and 480 male partners), we conducted a head-to-head comparison to systematically evaluate the incremental clinical value and detection performance of single-molecule nanopore sequencing in thalassemia (SMITH) against conventional hematological testing and next-generation sequencing (NGS). Findings The overall concordance rate between NGS and SMITH was 98.6% (3789/3842). The discrepant cases (n=53) were directly attributed to the superior detection capabilities of SMITH, which successfully identified complex structural rearrangements-including 45 -globin gene triplications and four HK alleles-that were missed by NGS. Furthermore, SMITH accurately detected four rare variants (c.134_135insT/, c.-22(C>T)/, {beta}N/{beta}c.316-290delinsAGGGCAATAATTT and {beta}3.5 kb deletion/{beta}N ) and resolved ten trans and three cis configurations within the globin gene allele. Clinically, these technical advantages translated to a 9.3% (5/54) increase in the detection rate of high-risk prenatal couples, effectively preventing one birth affected by moderate-to-severe thalassemia. Additionally, SMITH corrected a diagnostic discrepancy in one case (HK vs. -3.7), sparing the couple from an unnecessary invasive procedure. Interpretation Our findings demonstrate that SMITH provides a powerful platform for resolving globin gene rearrangements, detecting rare variants, and enabling direct haplotype phasing. By effectively eliminating diagnostic blind spots, SMITH is expected to become an optimal method for thalassemia prevention programs. Funding This study was supported by Chinese National Natural Science Foundation Projects 81760037 and 82271894.

18

Menopause in the All of Us Research Program: A Descriptive Summary of Electronic Health Record and Survey Response across Sociodemographic Characteristics

Staples, J. W.; White, S. L.; Giacalone, A.; Pozdeyev, N.; Sammel, M. D.; Stranger, B. E.; Valencia, C. I.; Santoro, N.; Hendricks, A. E.

2026-04-25 sexual and reproductive health 10.64898/2026.04.17.26351129 medRxiv

Top 0.1%

12.7%

Show abstract

ObjectiveMenopause is a significant physiological transition with implications for health outcomes (e.g., cardiometabolic), yet gaps remain in understanding the menopause transition, including how menopause timing and type influence health outcomes. Large-scale cohort studies in midlife (age[~]40-60) females, including the All of Us Research Program (AoURP), provide opportunities to study menopause across diverse populations and data modalities. We characterized menopause-related data in AoURP, focusing on age distributions and concordance between EHR diagnosis codes and self-reported survey responses. MethodsWe analyzed menopause-related survey, EHR diagnostic code, and genomic data among [~]396,000 participants in AoURP with female sex. We summarized menopause data across modalities, overlap between survey, EHR, and genomic data, and age distributions overall and across sociodemographic characteristics. ResultsAmong [~]396,000 females, surveys captured [~]193,000 menopause observations, nearly seven times more than structured EHR diagnoses ([~]28,000), suggesting under-ascertainement in EHR data. Nearly all females ([~]99%) with an EHR menopause diagnosis also reported menopause in the survey. Approximately 22,000 participants had intersected EHR, survey, and genomic menopause-related data. Survey-based age patterns matched expectations, with participants <40 years predominantly reporting pre-menopausal status and those >60 years predominantly reporting post-menopausal status. A small subset (N{approx}1,700; 4%) (age>70 years) reported no menopause, suggesting response or recall bias. EHR menopause codes were concentrated after age>45 years, with a notable spike at age 65. Modest differences in survey-based menopause age distributions were observed by sociodemographic characteristics (e.g., race, ancestry). ConclusionsThese findings inform sampling strategies, power calculations, phenotype definition, and study design for menopause research using AoURP.

19

Violence exposure and mental health problems among school-aged children in a South African birth cohort

Bailey, M.; Hammerton, G.; Fairchild, G.; Tsunga, L.; Hoffman, N.; Burd, T.; Shadwell, R.; Danese, A.; Armour, C.; Zar, H. J.; Stein, D. J.; Donald, K. A.; Halligan, S. L.

2026-04-22 psychiatry and clinical psychology 10.64898/2026.04.20.26351289 medRxiv

Top 0.1%

12.0%

Show abstract

ObjectiveThere is little longitudinal research investigating links between violence exposure and mental disorders among children in low- and middle-income countries (LMICs), despite high rates of violence. We examined cross-sectional and longitudinal violence-mental health associations among children in a large South African birth cohort, the Drakenstein Child Health Study, including direct clinical interviews capturing childrens mental disorders. MethodIn this birth cohort (N=974), we assessed lifetime violence exposure and four subtypes (witnessed community, community victimization, witnessed domestic, domestic victimization) at ages 4.5 and 8-years via caregiver reports. At 8-years, caregivers completed the Child Behaviour Checklist; and psychiatric disorders were assessed using the Mini-International Neuropsychiatric Interview for Children and Adolescents, a self-report measure. We tested for associations using linear/logistic regressions, adjusted for confounders. ResultsMost children (91%) had experienced violence by 8-years. Cross-sectionally, total violence exposure was associated with total (B =0.49 [95% CI 0.32, 0.66]), internalizing (0.32 [0.17, 0.47]), and externalizing problems (0.46 [0.31, 0.61]), and with increased odds of disorder at 8 years (aOR=1.09 [1.05, 1.13]). Longitudinally, total violence exposure up to 4.5-years was associated with total (B=0.27 [0.03, 0.52]), internalizing (0.24 [0.04. 0.44]), and externalizing scores (0.23 [0.008, 0.45]) at 8-years, but not with increased risk of psychiatric disorders. The strongest and most consistent associations were observed for domestic versus community violence subtypes. ConclusionOur strong cross-sectional but weaker longitudinal findings suggest that recent violence exposures may be more critical than early exposures for childrens mental health. Longitudinal exploration of other violence-affected LMIC populations is urgently needed.

20

Life Course Socioeconomic Position and health in older adulthood age: A Formal Mediation Analysis in the 1958 British Birth Cohort

Guo, Y.; Pelikh, A.; Ploubidis, G. B.; Goodman, A.

2026-03-25 epidemiology 10.64898/2026.03.23.26349085 medRxiv

Top 0.1%

11.9%

Show abstract

Background Childhood socioeconomic position (SEP) is a key determinant of later life health. Understanding the extent to which adult SEP mediates this association into early old age is important for explaining how health inequalities are propagated across generations and how they might be addressed in later life. To our knowledge, no prospective study has examined whether childhood SEP remains associated with health at the threshold of older age and the extent to which any such association is mediated by adult SEP. Methods We used data from the 1958 British Birth Cohort, a prospective study that has followed participants since birth, drawing on earlier data collected at birth and ages 33 and 55 years and newly collected data from the age 62 sweep. Using interventional causal mediation analyses, we assessed whether adult occupational class, education, housing tenure, and income mediate associations between childhood social class (manual vs non manual) and health at age 62 (self rated health, C reactive protein [CRP], cholesterol ratio, Glycated hemoglobin [HbA1c], and N terminal pro B type natriuretic peptide [NT proBNP]). Findings Associations between childhood SEP and self rated health, CRP, cholesterol ratio, and HbA1c persisted after accounting for adult SEP. Mediation was outcome specific and differed by sex. Among men, occupational class mediated 39% of the association with self rated health (indirect effect RR 0.90, 95% CI 0.86,0.95) and education mediated 27% (0.93, 0.90,0.96). Among women, education mediated 10% (0.95, 0.91,0.98) and housing tenure mediated 6% (0.97, 0.94,0.99). Indirect effects for CRP were smaller, and mediation was minimal for cholesterol ratio, HbA1c, and NT proBNP Interpretation Population level improvements in adult SEP could reduce, but are unlikely to eliminate, later life health inequalities associated with childhood SEP. Reducing these inequalities will require policies that address disadvantage in early life and improve adult financial and employment conditions. Funding UK Economic and Social Research Council